I am using formula 1 dataset again to plot the most successful drivers

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import numpy as np
In [2]:
df_driver = pd.read_csv("GrandPrix_drivers_details_1950_to_2022.csv", encoding='ISO-8859-1')
df_driver
Out[2]:
Pos Driver Nationality Car PTS year
0 1 Nino Farina FAR ITA Alfa Romeo 30.0 1950
1 2 Juan Manuel Fangio FAN ARG Alfa Romeo 27.0 1950
2 3 Luigi Fagioli FAG ITA Alfa Romeo 24.0 1950
3 4 Louis Rosier ROS FRA Talbot-Lago 13.0 1950
4 5 Alberto Ascari ASC ITA Ferrari 11.0 1950
... ... ... ... ... ... ...
1613 18 Zhou Guanyu ZHO CHN Alfa Romeo Ferrari 6.0 2022
1614 19 Alexander Albon ALB THA Williams Mercedes 4.0 2022
1615 20 Nicholas Latifi LAT CAN Williams Mercedes 2.0 2022
1616 21 Nyck De Vries DEV NED Williams Mercedes 2.0 2022
1617 22 Nico Hulkenberg HUL GER Aston Martin Aramco Mercedes 0.0 2022

1618 rows × 6 columns

In [3]:
#calculating drivers active years
active_years = df_driver.groupby('Driver')['year'].agg(['min', 'max']).reset_index()
active_years['ActiveYears'] = active_years.apply(lambda row: f"{row['min']}-{row['max']}", axis=1)
active_years
Out[3]:
Driver min max ActiveYears
0 Adrian Sutil SUT 2007 2014 2007-2014
1 Aguri Suzuki SUZ 1990 1995 1990-1995
2 Alain Prost PRO 1980 1993 1980-1993
3 Alan Jones JON 1975 1986 1975-1986
4 Alan Brown BRO 1952 1952 1952-1952
... ... ... ... ...
391 Yuji Ide IDE 2006 2006 2006-2006
392 Yuki Tsunoda TSU 2021 2022 2021-2022
393 Yves Giraud Cabantous CAB 1950 1951 1950-1951
394 Zhou Guanyu ZHO 2022 2022 2022-2022
395 Zsolt Baumgartner BAU 2003 2004 2003-2004

396 rows × 4 columns

In [4]:
df_race = pd.read_csv("GrandPrix_races_details_1950_to_2022.csv", encoding='ISO-8859-1')
In [5]:
df_race.head()
Out[5]:
Grand Prix Date Winner Car Laps Time year
0 Great Britain 13 May 1950 Nino Farina FAR Alfa Romeo 70.0 2:13:23.600 1950.0
1 Monaco 21 May 1950 Juan Manuel Fangio FAN Alfa Romeo 100.0 3:13:18.700 1950.0
2 Indianapolis 500 30 May 1950 Johnnie Parsons PAR Kurtis Kraft Offenhauser 138.0 2:46:55.970 1950.0
3 Switzerland 04 Jun 1950 Nino Farina FAR Alfa Romeo 42.0 2:02:53.700 1950.0
4 Belgium 18 Jun 1950 Juan Manuel Fangio FAN Alfa Romeo 35.0 2:47:26.000 1950.0
In [6]:
df_merged = pd.merge(df_race[['Winner', 'Car', 'Date']], df_driver[['Driver', 'Car', 'PTS', 'year']],
                     left_on=['Winner', 'Car'], right_on=['Driver', 'Car'])

df_merged = df_merged[['Winner', 'Car', 'Date', 'PTS']]
print(df_merged.head())
            Winner         Car         Date   PTS
0  Nino Farina FAR  Alfa Romeo  13 May 1950  30.0
1  Nino Farina FAR  Alfa Romeo  13 May 1950  19.0
2  Nino Farina FAR  Alfa Romeo  04 Jun 1950  30.0
3  Nino Farina FAR  Alfa Romeo  04 Jun 1950  19.0
4  Nino Farina FAR  Alfa Romeo  03 Sep 1950  30.0
In [7]:
df_merged['Year'] = pd.to_datetime(df_merged['Date'], format='mixed',dayfirst=True).dt.year
df_merged.head()
# Calculate the total PTS for a driver in each year
#df_total_pts = df_merged.groupby(['Winner', 'Year'])['PTS'].sum().reset_index()
#df_total_pts.head()
Out[7]:
Winner Car Date PTS Year
0 Nino Farina FAR Alfa Romeo 13 May 1950 30.0 1950
1 Nino Farina FAR Alfa Romeo 13 May 1950 19.0 1950
2 Nino Farina FAR Alfa Romeo 04 Jun 1950 30.0 1950
3 Nino Farina FAR Alfa Romeo 04 Jun 1950 19.0 1950
4 Nino Farina FAR Alfa Romeo 03 Sep 1950 30.0 1950
In [8]:
#sum up the column of points for each driver grouping by years
df_merged['PTS'] = df_merged.groupby(['Winner', 'Year'])['PTS'].transform('sum')

# Count the occurrence of each winner in each year
df_merged['Wins'] = df_merged.groupby(['Winner', 'Year'])['Winner'].transform('count')

# Select the final columns
df_final = df_merged[['Winner', 'Year', 'PTS', 'Wins', 'Car']].drop_duplicates()
df_final.head()
Out[8]:
Winner Year PTS Wins Car
0 Nino Farina FAR 1950 147.0 6 Alfa Romeo
6 Nino Farina FAR 1951 49.0 2 Alfa Romeo
8 Juan Manuel Fangio FAN 1950 174.0 6 Alfa Romeo
14 Juan Manuel Fangio FAN 1951 174.0 6 Alfa Romeo
20 Johnnie Parsons PAR 1950 9.0 1 Kurtis Kraft Offenhauser
In [9]:
df_final.sort_values('Year', inplace=True)

# Convert 'Wins' column to cumulative sum
df_final['Wins'] = df_final.groupby('Winner')['Wins'].cumsum()
df_final['PTS'] = df_final.groupby('Winner')['PTS'].cumsum()

# Select the final columns
df_final = df_final[['Winner', 'Year', 'PTS', 'Wins', 'Car']].drop_duplicates()
df_final.head()
Out[9]:
Winner Year PTS Wins Car
0 Nino Farina FAR 1950 147.0 6 Alfa Romeo
8 Juan Manuel Fangio FAN 1950 174.0 6 Alfa Romeo
20 Johnnie Parsons PAR 1950 9.0 1 Kurtis Kraft Offenhauser
6 Nino Farina FAR 1951 196.0 8 Alfa Romeo
14 Juan Manuel Fangio FAN 1951 348.0 12 Alfa Romeo
In [10]:
#check for min and macx values to set the x and y range in plot
min_pts = df_final['Wins'].min()
min_pts
Out[10]:
1
In [11]:
max_wins = df_final['Wins'].max()
max_wins
Out[11]:
946

I have to use some kind of normalization because for the values of column PTS,there is a wide range of values from 6 to 4225925. This causes the scatter plot bubbles to appear very tiny. I tried using log as well as it can help in visually compressing the data, but that didnt help either.

In [15]:
from sklearn.preprocessing import MinMaxScaler

# Perform Min-Max normalization on the 'Wins' column
scaler = MinMaxScaler(feature_range=(0, 1))
df_final['PTS_normalized'] = scaler.fit_transform(df_final['PTS'].values.reshape(-1, 1))

df_final['Wins_normalized'] = scaler.fit_transform(df_final['Wins'].values.reshape(-1, 1))
In [16]:
maxwins=df_final['Wins'].max()
maxwins
Out[16]:
946
In [17]:
# Create the scatter plot with the normalized 'Wins' column
fig = px.scatter(df_final, x='PTS_normalized', y='Wins_normalized', animation_frame='Year', animation_group='Winner',
                 size='PTS_normalized', color='Winner', hover_name='Car', range_y=[0, 1], range_x=[0, 1])


fig.update_layout(xaxis_title="PTS", yaxis_title="Wins", legend_title="Winner")
# Set figure layout
#fig.update_layout(title='Drivers with Most Wins and PTS Over the Years',
#                  xaxis_title='Total points', yaxis_title='Year')

# Show the figure
fig.show()

You can see very tiny scatter plots from 1950 to 1990, then we clearly some bubbles, two of them very big, these drivers are Michael Schumacher and Lewis Hamilton, they almost standout very loudly from the rest of the drivers.

I will now focus on the data after year 1990. And also zoom in to view the left bottom corner of the previous visual.

In [18]:
filtered_data = df_final[df_final['Year'] >= 1990]
In [19]:
filtered_data.head()
Out[19]:
Winner Year PTS Wins Car PTS_normalized Wins_normalized
1712 Thierry Boutsen BOU 1990 213.0 6 Williams Renault 0.000677 0.005291
1743 Nelson Piquet PIQ 1990 2678.0 64 Benetton Ford 0.008745 0.066667
1727 Riccardo Patrese PAT 1990 208.0 7 Williams Renault 0.000661 0.006349
1717 Alain Prost PRO 1990 8511.5 135 Ferrari 0.027837 0.141799
1706 Nigel Mansell MAN 1990 2331.0 45 Ferrari 0.007609 0.046561
In [20]:
#also noticed that legend doesnt display all driver, probably cause the high number of drivers. Lets take only the top 15 drivers for the next visual
unique_drivers = filtered_data['Winner'].nunique()
unique_drivers
#figured out that the legend was missing drivers because of the same color being assigned to more than one drivers
Out[20]:
43
In [21]:
top_drivers = filtered_data['Winner'].value_counts().head(20).index.tolist()
df_top_15 = filtered_data[filtered_data['Winner'].isin(top_drivers)]
df_top_15
Out[21]:
Winner Year PTS Wins Car PTS_normalized Wins_normalized
1706 Nigel Mansell MAN 1990 2331.0 45 Ferrari 0.007609 0.046561
1622 Ayrton Senna SEN 1990 7966.0 110 McLaren Honda 0.026051 0.115344
1794 Gerhard Berger BER 1991 880.0 28 McLaren Honda 0.002860 0.028571
1652 Ayrton Senna SEN 1991 10584.0 145 McLaren Honda 0.034619 0.152381
1749 Nigel Mansell MAN 1991 3296.0 60 Williams Renault 0.010767 0.062434
... ... ... ... ... ... ... ...
4945 Sergio Perez PER 2021 367.0 3 Red Bull Racing Honda 0.001181 0.002116
4948 Daniel Ricciardo RIC 2021 3646.0 20 McLaren Mercedes 0.011913 0.020106
4654 Lewis Hamilton HAM 2021 305558.0 946 Mercedes 1.000000 1.000000
4950 Max Verstappen VER 2022 21790.5 68 Red Bull Racing RBPT 0.071296 0.070899
4965 Sergio Perez PER 2022 977.0 5 Red Bull Racing RBPT 0.003178 0.004233

122 rows × 7 columns

In [45]:
# Set the color_discrete_sequence with a list of colors
color_sequence = px.colors.qualitative.Set1[:30]  # Adjust the number of colors as needed
fig = px.scatter(filtered_data, x='PTS_normalized', y='Wins_normalized', animation_frame='Year', animation_group='Winner',
                 size='PTS_normalized', color='Winner', hover_name='Car', range_y=[0, 1], range_x=[0, 1],
                 labels={'PTS_normalized': 'PTS', 'Wins_normalized': 'Wins'},
                 title='Top 15 Drivers: PTS vs Wins (1990-2020)',
                 color_discrete_sequence=color_sequence)
                 
fig.update_layout(xaxis_title="PTS", yaxis_title="Wins", legend_title="Winner")

fig.show()

Now, the data is visible as compared to the first one. we can pause and see the bubbles for details by hovering over it. I am still facing issues while assigning colors to the drivers.

In [24]:
#Most successful drivers since 1950 -2020

df_winners = df_race.groupby(['Winner', 'Car']).size().reset_index(name='Wins')
df_winners = df_winners.sort_values(by=['Wins'], ascending=False)
df_winners = df_winners[df_winners['Wins'] >= 7]
fig = px.scatter(df_winners, x="Wins", y="Winner", color="Car", size="Wins", hover_name="Car",
                 hover_data={"Wins": True},
                 title="Most Successful Drivers and Teams in Terms of Race Wins",
                 labels={"Wins": "Race Wins", "Winner": "Driver"})
fig.update_layout(xaxis_title="Race Wins", yaxis_title="Driver", legend_title="Car")
fig.show()
In [25]:
fig.write_html('top_drivers_animation.html', auto_open=False)  # Save as HTML file
In [26]:
constructors= pd.read_csv('constructors.csv')
constructors.head()


constructors= constructors[['name','nationality']]
In [27]:
 count= constructors.groupby( [ "nationality","name"] ).size().reset_index()
In [28]:
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
constructors_count = constructors['nationality'].value_counts()
data = [go.Bar(
    x = constructors_count.index,
    y = constructors_count.values,
   # marker = dict(color = random_colors(25))
)]
layout = dict(
         title= "Constructors by Country "
)
fig = dict( data=data, layout=layout )
iplot( fig, validate=False )

Most race constructor teams have been british, followed by American, Italian and then the number really drops from French, German and the others following behind.

In [29]:
constructors_count = constructors['nationality'].value_counts()
trace = go.Pie(labels=constructors_count.index, values=constructors_count.values, hole=0.6,textinfo= "none")
layout = go.Layout(
    title='Percentage of Constructors by Nationality'
)
fig = go.Figure(data=[trace], layout=layout)
iplot(fig, filename="plotting-library")

Plotting the very recent British GrandProx results which happened this sunday on 9th July 2023. Visualing the fastest cars during each laps, focusing on the top 5 drivers.

In [30]:
laptimes = pd.read_csv('lap_times.csv')
laptimes.head()
Out[30]:
raceId driverId lap position time milliseconds
0 841 20 1 1 1:38.109 98109
1 841 20 2 1 1:33.006 93006
2 841 20 3 1 1:32.713 92713
3 841 20 4 1 1:32.803 92803
4 841 20 5 1 1:32.342 92342
In [31]:
drivers = pd.read_csv('drivers.csv')
drivers.head()
Out[31]:
driverId driverRef number code forename surname dob nationality url
0 1 hamilton 44 HAM Lewis Hamilton 1985-01-07 British http://en.wikipedia.org/wiki/Lewis_Hamilton
1 2 heidfeld \N HEI Nick Heidfeld 1977-05-10 German http://en.wikipedia.org/wiki/Nick_Heidfeld
2 3 rosberg 6 ROS Nico Rosberg 1985-06-27 German http://en.wikipedia.org/wiki/Nico_Rosberg
3 4 alonso 14 ALO Fernando Alonso 1981-07-29 Spanish http://en.wikipedia.org/wiki/Fernando_Alonso
4 5 kovalainen \N KOV Heikki Kovalainen 1981-10-19 Finnish http://en.wikipedia.org/wiki/Heikki_Kovalainen
In [32]:
results = pd.read_csv('results.csv')
results.head()
Out[32]:
resultId raceId driverId constructorId number grid position positionText positionOrder points laps time milliseconds fastestLap rank fastestLapTime fastestLapSpeed statusId
0 1 18 1 1 22 1 1 1 1 10.0 58 1:34:50.616 5690616 39 2 1:27.452 218.300 1
1 2 18 2 2 3 5 2 2 2 8.0 58 +5.478 5696094 41 3 1:27.739 217.586 1
2 3 18 3 3 7 7 3 3 3 6.0 58 +8.163 5698779 41 5 1:28.090 216.719 1
3 4 18 4 4 5 11 4 4 4 5.0 58 +17.181 5707797 58 7 1:28.603 215.464 1
4 5 18 5 1 23 3 5 5 5 4.0 58 +18.014 5708630 43 1 1:27.418 218.385 1
In [51]:
# Merge the laptimes and drivers datasets based on driverId
merged_data = pd.merge(laptimes, drivers, on='driverId')

merged_data['time_seconds'] = merged_data['milliseconds'] / 1000

merged_data=merged_data[merged_data['raceId'] == 1108]


merged_filtered_data = merged_data[(merged_data['time_seconds'] >= 92) & (merged_data['time_seconds'] <= 95)]

# Group the data by driverId and create a line plot for each driver
plt.figure(figsize=(12, 6))

grouped_data = merged_filtered_data.groupby('driverId')
for driverId, group in grouped_data:
    driver_surname = group['surname'].iloc[0]
    plt.plot(group['lap'],group['time_seconds'],  label=driver_surname)

# Set labels and title
plt.ylabel('Time (seconds)')
plt.xlabel('Lap')
plt.title('Lap Times (92-95 seconds)')
plt.legend(title='Driver',loc='best')
# Show the plot
plt.show()

A lot of lines,as the plot is trying to visualise laptime for 20 drivers currently, lets focus on the fastest 5 drivers from the set.

In [34]:
# Filter the data for raceId 1108
race_1108 = results[results['raceId'] == 1108]

# Sort the data by positionOrder in ascending order
sorted_data = race_1108.sort_values('positionOrder')

# Get the top 5 drivers
top_5_drivers = sorted_data.head(5)
In [35]:
# Merge the datasets based on the common columns
merged_data = pd.merge(merged_data,top_5_drivers, on=['raceId', 'driverId'])
merged_data
Out[35]:
raceId driverId lap position_x time_x milliseconds_x driverRef number_x code forename ... positionOrder points laps time_y milliseconds_y fastestLap rank fastestLapTime fastestLapSpeed statusId
0 1108 1 1 8 1:40.664 100664 hamilton 44 HAM Lewis ... 3 15.0 52 +6.783 5123721 43 3 1:30.545 234.221 1
1 1108 1 2 8 1:34.361 94361 hamilton 44 HAM Lewis ... 3 15.0 52 +6.783 5123721 43 3 1:30.545 234.221 1
2 1108 1 3 8 1:33.662 93662 hamilton 44 HAM Lewis ... 3 15.0 52 +6.783 5123721 43 3 1:30.545 234.221 1
3 1108 1 4 8 1:33.086 93086 hamilton 44 HAM Lewis ... 3 15.0 52 +6.783 5123721 43 3 1:30.545 234.221 1
4 1108 1 5 8 1:33.718 93718 hamilton 44 HAM Lewis ... 3 15.0 52 +6.783 5123721 43 3 1:30.545 234.221 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
255 1108 857 48 4 1:31.159 91159 piastri 81 PIA Oscar ... 4 12.0 52 +7.776 5124714 41 4 1:30.850 233.435 1
256 1108 857 49 4 1:31.422 91422 piastri 81 PIA Oscar ... 4 12.0 52 +7.776 5124714 41 4 1:30.850 233.435 1
257 1108 857 50 4 1:31.752 91752 piastri 81 PIA Oscar ... 4 12.0 52 +7.776 5124714 41 4 1:30.850 233.435 1
258 1108 857 51 4 1:31.305 91305 piastri 81 PIA Oscar ... 4 12.0 52 +7.776 5124714 41 4 1:30.850 233.435 1
259 1108 857 52 4 1:30.941 90941 piastri 81 PIA Oscar ... 4 12.0 52 +7.776 5124714 41 4 1:30.850 233.435 1

260 rows × 31 columns

In [36]:
merged_data['time_seconds'] = merged_data['milliseconds_x'] / 1000


merged_filtered_data = merged_data[(merged_data['time_seconds'] >= 92) & (merged_data['time_seconds'] <= 95)]
In [37]:
# Group the data by driverId and create a line plot for each driver
grouped_data = merged_filtered_data.groupby('driverId')
grouped_data
Out[37]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000258B6A82F40>
In [49]:
plt.style.use('dark_background')
plt.figure(figsize=(12, 6))
for driverId, group in grouped_data:
    driver_surname = group['surname'].iloc[0]  # Get the surname of the driver
    plt.plot(group['lap'],group['time_seconds'],  label=driver_surname)

# Set labels and title
plt.ylabel('Time (seconds)')
plt.xlabel('Lap')
plt.title('British GP Lap Times (92-95 seconds)')

plt.legend(title='Driver')

# Show the plot
plt.show()

So we set the x axis with a time range of 92-95 seconds as it is the average time taken by the drivers for covering each lap. We see that Verstappen is very low on the graph, meaning he takes less time than others to cover the laps implying that he is the fastest among them all. Also, notice that Norris was faster than everybody else in the beginning of the race, giving Verstappen a good challenge for most part of the race until the 20th lap, then his car becomes slower and Hamilton gains speed from 26th lap after being the slowest among these 5 drivers for starting part of the race.

In [ ]: